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Randomized Structural Sparsity based Support 
Identification with Applications to Locating 
Activated or Discriminative Brain Areas: A 
Multi-center Reproducibility Study 

Yilun Wang, Sheng Zhang, Junjie Zheng, Heng Chen, and Huafu Chen, 


Abstract 

In this paper, we focus on how to locate the relevant or discriminative brain regions related with 
external stimulus or certain mental decease, which is also called support identihcation, based on the 
neuroimaging data. The main difficulty lies in the extremely high dimensional voxel space and relatively 
few training samples, easily resulting in an unstable brain region discovery (or called feature selection in 
context of pattern recognition). When the training samples are from different centers and have between- 
center variations, it will be even harder to obtain a reliable and consistent result. Corresponding, we 
revisit our recently proposed algorithm based on stability selection and structural sparsity. It is applied 
to the multi-center MRI data analysis for the hrst time. A consistent and stable result is achieved across 
different centers despite the between-center data variation while many other state-of-the-art methods such 
as two sample t-test fail. Moreover, we have empirically showed that the performance of this algorithm 
is robust and insensitive to several of its key parameters. In addition, the support identihcation results 
on both functional MRI and structural MRI are interpretable and can be the potential biomarkers. 
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Index Terms 
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fMRI; sMRI; Multimodal; pattern recognition 

I. Introduction 

A. Problem Statement 

For computer multimedia analysis, a challenging problem is the semantic gap between the high-level 
perception and cognition and the low-level features of the digital contents. Considering that the human 
brain can eliminate this gap very naturally, deep understanding of the brain responses to multimedia will 
be playing an important role in designing the computational strategies for multimodal representation, 
classification and retrieval. They share a fundamental problem with many other brain related fields 
such as clinical diagnosis of mental diseases. That is how to reliably locate the relevant brain regions 
corresponding to either external multimedia stimulus or certain mental disease. Via modern brain imaging 
techniques such as functional magnetic resonance imaging (fMRI), electroencephalography (EEC) and 
magnetoencephalography (MEG), ones can image the structure, function or pharmacology of the nervous 
system in a non-invasive way. Recently, learning from neuroimaging data, as a kind of pattern recognition, 
has led to impressive results H]], such as guessing which image or video a subject is looking at or where 
are the most activated brain regions when one responses to multimedia stimulus in brain disorders. In 
many such applications, the commonly and reproducibly selected brain areas can be considered as the 
potential biomarkers lO, lO, IH, Q. Without reliable, reproducible results, no study in generally can 
effectively contribute to scientific knowledge |6l. 

Discovering the discriminative brain regions corresponding to certain mental decease or activated brain 
regions corresponding to certain external stimuli, also called support identification, is a typical feature 
selection problem in the context of pattern recognition. However, due to the high dimensionality of 
feature space and relatively few samples, it is quite a challenging task to accurately and stably estimate 
discriminative voxels Q. Many existing multi-variable pattern analysis methods often fail to provide 
a stable feature selection result, i.e., there exists the inherent significant run to run variability in the 
decision space generated by the classifiers in terms of even very small changes to the training set HI. 
This instability is partially because classic multivariate feature selection methods aim at selecting a 
minimum subset of features to construct a classifier of the best predictive accuracy and often ignore 
“stability” in the algorithm design lO, lITOl . ifTTI . ifT^ . IH, ifTSl . 
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Therefore, we will focus more on the stability and reliahility of the feature selection results, for the sake 
of scientific truth. Correspondingly, our research will he based on multi-center neuroimage data analysis 
because multi-center data allows to test the reproducibility and consistency of the feature selection results. 
In addition, a multi-center study offers several advantaged over a single center study. For example, it has 
the potential to increase the number and diversity of subjects enrolled, and to include significant numbers 
of subjects from rare clinical subgroups llT4l . While inter-center variability brings great challenges for 
consistent feature selection results, it can also be used to test the performance of different algorithms. 
Finally, we would like to emphasize that in line with the trend towards Big Data in brain study, major 
advances have been made in the availability of shared neuroimaging data. There have existed more than 
8,000 shared MRI (magnetic resonance imaging) data sets available online lITSll . The quantitative analysis 
of these multicenter data sets from related magnetic resonance imaging experiments has the potential to 
significantly accelerate progress in brain mapping. 

In order to further increase the trustworthiness of the discovered biomarkers, we consider both func¬ 
tional MRI (fMRI, for short) and structural MRI (sMRI, for short). While the results from functional 
imaging are increasingly being submitted as evidence into the legal systems of many nations, the reliability 
of the evidence from fMRI still needs to further proved IQ. Therefore, we also consider the structural 
MRI data, which is generally considered to be less dynamically changed. The biomarker results from 
both fMRI and sMRI will be considered together to give a better and more convincing understanding of 
the decease, i.e. Autism Spectrum Disorder (ASD) in this paper lIT^ . ifTTl . 

To our best knowledge, most of existing algorithms fail to overcome this difficulty caused by data 
variation. For example, the univariate approaches have been popular for finding fhe discriminative regions 
due to their being directly testable, easily interpretable, and computationally tractable. However, it is 
not quite robust to the data perturbation possibly caused by the between-center variation. In addition, 
there exists a strong demand in various multivariate approaches in recent years HU, because the brain 
representation is intrinsically multivariate. However, as mentioned before, many multivariate methods 
often ignore “stability” in terms of feature selection. 

We aim to achieve a consistent feature selection result across different sites. Correspondingly, we 
will revisit our recently proposed concept of “randomized structural sparsity” and apply it to the multi¬ 
center data analysis. We will verify the stability of the resulted algorithm in terms of the generalization 
and reproducibility of the selected features via the multi-center data IT9l . In particularly, we will check 
whether a consistent feature selection result can be obtained across different centers, and this consistency 
will support the reliability of the selected features as the potential true biomarker of the associated disease. 
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i.e. ASD here. 

B. Advantages and Limitations of Sparsity Applied to Nueroimaging 

While most of existing efforts to deal with multi-center data is to merge the data from different sites into 
a pool and analyze it as a whole, like a hig single center, we consider each of multiple centers individually 
ll20l . In particular, we consider each of the centers separately and apply the feature selection method 
to obtain the potential hiomarker. Then we compare the results from different centers to see whether 
they are consistent to a certain degree. While this seems to he a natural and better way to validate the 
reproducibility of the final selected biomarkers than only considering a single center data, it brings a great 
challenge to the feature selection method, which is required to be stable due to the possible high data 
variability across different centers. Now we first review some commonly used feature selection methods 
in the following parts. 

We consider to identify the discriminative brain voxels from the given labeled training MRI data in 
context of the supervise learning. While the classification problem is mainly considered, the regression 
problem can be treated in a similar way. The linear model is widely used for feature selection as follows. 

y = Xw + e (1) 

where y G is the binary classification information and X € is the given training voxel-wise 

MRI data and w G is the unknown weights reflecting the degree of importance of each voxel. 

Identification of discriminative voxels is based on the values of the weight vector w and their importance 
is usually proportional to the absolute values of the components. 

Considering that the common challenge in this field is the curse of dimensionality p ^ n. sparsity 
makes much sense that the most discriminative activated voxels are only a small portion of the total 
brain voxels m. However, sparsity alone is not sufficient for making reasonable and stable inferences, 
in cases of very high voxel space and small number of training samples. In such cases, the plain sparse 
learning models such as the ^i-norm regularized model also called LASSO ll22l . often provide overly 
sparse and hard to interpret solutions Il23l . Thus we need to extend the plain sparse learning model to 
incorporate some important structural features of brain imaging data in order to achieve a more stable, 
reliable and interpretable support identification results. 

C. Existing Extensions of the Plain Sparse Model 

Functional segregation of local brain areas that differ in their anatomy and physiology contrasts sharply 
with their global integration during perception and behavior E^ . Il25l . and sets of discriminative or 
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activated voxels between different brain states are expected to form distributed and localized areas Il26l . 
HH, ||271. Therefore, two common hypothesis have been made for MRI data analysis. One is the sparsity: 
the discriminative or activated voxels implied in the classification task or an external stimulus are only 
a small portion of the total; the other is compact structure: these voxels are often grouped into several 
distributed clusters because those within a cluster have similar behaviors. Notice that the plain sparse 
learning model based on the li regularization is only making use of the first sparsity hypothesis and 
discard the structure related assumption. 

Elastic Net |[28l tries to make use of the voxel correlation by adding an I 2 regularization or called 
Tikhonov regularization to the classical li penalty. The strict convexity due to the help of the added £2 
term helps to select all together a group of redundant features or none of them in what is known as the 
grouping effect Il29l . |[30l . Recently, Total-Variation (TV) penalization was added to both penalization 
for voxel selection ifSTl . The TV penalization is used to make use of the assumption that the activations 
are spatially correlated and the weights of the voxels are close to piece-wise constant. 

Furthermore, structured sparsity models are proposed to more explicitly make use of the segregation 
and integration of the brain |[32ll . in order to extend the well-known plain models. They enforce more 
structured constraints on the solution, such as the discriminative voxels are grouped together in possibly 
few clusters 13^ . 1341 . Il35l . If36l . Il37l . |[38l . Il39l . Il40l . In many cases, the parcellation information is 
not available beforehand, and therefore, ones can either use the anatomical regions as an approx iHTl . 
or use the data driven methods to obtain the grouping information H2l . Il43l . Notice that the kind of 
parcellation based on certain definition of homogeneousness may not exactly match the ground truth of 
the discriminative voxels. However, most structural sparsity models tend to select either all voxels of a 
group or none of them, and fail to perform further refinements by selecting part of voxels of a group. 

Another important kind of feature selection methods for high dimensional data analysis is stability 
selection Il44l . Il45l . Il46l . which is based on resamplings (bootstrapping would behave similarly) and 
aims to alleviate the disadvantage of the plain £1 model, which either selected by chance non-informative 
regions, or neglected relevant regions that provide duplicate or redundant classification information 1471 . 
||48]1 . im, ll49l . |[50l . ll5ll . |[52l . One main advantages of stability selection is the finite sample control of 
false positives. It also makes the choice of sparsity penalty parameter much less critical or sensitive Q. 
However, it fails to explicitly make use of the assumption that these targeted voxels are often spatially 
contiguous and form into distributed brain regions. Correspondingly, we proposed so called “randomized 
structural sparsity” based feature selction algorithm in |[5^ . by incorporating the idea of structural sparsity 
into stability selection. 


June 9, 2015 


DRAFT 


JOURNAL OF UTbX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 


6 


D. Our focus and contributions 

In this paper, we consider to achieve a consistent and reproducihle feature selection result. We revisit 
the “randomized structural sparsity” based feature selection method we recently proposed in 1531, where 
the finite sample control of false positives and false negatives are considered on a single-center study. 
Here we demonstrate its reproducibility and stability of the selected biomarkers verified through the 
multi-center data analysis. We make a further analysis of this “randomized structural sparsity” based 
algorithm, by considering the sensitivity of several key parameters of our algorithm such as the number 
of the clustering, the block size of block samplings, for the first time. In addition, we proposed a simple 
but effective heuristic way for setting the threshold value, which is used to filter out the uninformative 
features. Finally, we consider both structural MRI and functional MRI modalities. The revealed potential 
biomarkers from these two modals will be put together for better understanding of the ASD than the 
single modality. 

The rest of the paper is organized as follows. In section ini we revisit our recent algorithm for stable 
voxel selection. In section JIIJ we demonstrate the advantages of our algorithm on the real multi-center 
neuroimage data in terms of better consistency of selected discriminative voxels across different centers. 
In section |IVj a short summary of our work and some possible future research directions will be given. 

II. Randomized Structural Sparsity Applied to Multi-Center Data Analysis 
A. The Background and Motivation 

We focus more on how to obtain a reliable support identification in context of many fields including 
mental decease or external stimuli including multimedia recognition. This is not a trivial task. In this 
paper, we revisit the Randomized Structural Sparsity (RSS, for shot) based algorithm we proposed in 
m, and aim to demonstrate that this algorithm can achieve a stable and consistent feature selection 
result even in the existence of data variations. 

MRI data is profoundly affected by experimental and methodological factors. There is a high likelihood 
of introducing undesirable inter-site variability into the data, reducing the benefit of the multi-center 
design. Here we take this challenge of data variation among multiple centers as a favor to demonstrate 
the stability and reliability of RSS. We empirically show that most existing feature selection methods 
fail to obtain a consistent result because they are not able to deal with the possibly relatively high data 
variability across different centers. 
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B. Review of Randomized Structural Sparsity 

We first review the Randomized Structural Sparsity based feature selection method. Denote an MRI 
data matrix as X G where n is the number of samples and p is the number of voxels with n <^p, 

and corresponding classification labels y G where the binary classification and y* G {1,-1} is 

considered here. We take the following sparse logistic regression for classification as the example to 
describe the main idea of our algorithm. 

n 

mm||w||i + A^log(l + exp(-yi(Xfw + c))) (2) 

i=l 

where Xj denotes the z-th row of X G y G R"^^^ is the labeling information containing the 

classification information of each row of X; w G R^^^ is the weight vector for the voxels; c is the 
intercept (scalar). The voxels corresponding to Wj of large absolute value will be considered as the 
discriminative voxels. 

Structured sparsity models beyond the plain models have been proposed to incorporate more 
structured constraints on the solution |[32ll . |l54j|, |[55ll . The common way to make use of the clustering or 
grouping structure is the following group sparsity induced norm induced model |[5^ . 

n 

minY] ||wg ||2 + A y]log(l + exp(-yj(w^Xi + c))), (3) 

g^Q i=l 

where Q is the grouping information. However, the obtaining of appropriate Q might be difficult in 
practice, and the final results might be too biased by the grouping information Q. 

Recently, stability selection Il44ll based on the plain model has been widely applied ll50l . ISTl . Il52l . 
due to its finite sample control of the false positives. In addition, it makes the choice of the regularization 
parameter insensitive. However, it fails to make use of the prior structural information of the discriminative 
voxels, and may result into a big false negative rate in order to keep low false positive rate. 

Therefore, we proposed “randomized structural sparsity” in Il5^ . which incorporates the spatial struc¬ 
tural knowledge of voxels into the stability selection framework. It aims to achieve low false positive rate 
and false negative rate simultaneously. In the paper, we will further reveal its stability and reliability via 
the multi-center data analysis, i.e. a consistent biomarker can be found across different sites due to its 
robustness to relatively small data perturbation. It is of important significance for the reliable scientific 
truth. While “randomized structural sparsity” is a general concept, it has a specific implementation for 
voxel-wise MRI data anlaysis, named “Constrained Block Subsampling”. 

“Constrained Block Subsampling” prefers to using the block subsampling based stability selection 
ifSTll . rather than the original reweighting based stability selection Specifically, for the training data 
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matrix X € subsampling based stability selection consists in applying the baseline, i.e. the pure £i 

regularization model such as Q, to random submatrices of X of size [n/L] x [p/F], where [] is the round 
off to the nearest integer number, and returning those features having the largest selection frequency. The 
original stability selection Il44ll can be roughly considered as a special case of it, where L = 2 and V = 1, 
except that the original stability selection reweighs each feature (voxel, here) by a random weight 
uniformaly sampled in [a, 1] where a is a positive number, and subsampling can be intuitively seen as a 
more crude version by simply dropping out randomly a large part of the features lISTll . We further proposed 
to make use of the block subsampling Il58l . which aims to replicate the correlation by subsampling blocks 
of voxels instead of scattered voxels. The intuition of “blocking” exists in the assumption that the voxels 
are partitioned into spatially contiguous homogeneous subgroups, though possibly distributed. 

Moreover, we incorporate the parcelling information of the brain into block subsampling, resulting in the 
so called “constrained block subsampling”, where the “constrained” means that the parcelling information 
will be respected. The kind of partition information can be based on either the prior anatomical knowledge 
of brain partition Il59l . or the clustering results based on the MRI data. In particular, for each cluster 
p G it may consist of either only one or several distributed localized brain regions or called partitions, 
because a cluster based on “homogeneousness” could be disrupted into several different brain areas 
|[59]l . Correspondingly, the selected voxels from the same cluster, after the common block subsamplings, 
will be considered as a subgroup, i.e., the chosen voxels lying in a cluster g ^ Q are noted as a set 
g' C g. Furthermore, in order to make sure that those small clusters can also be sampled during the 
block subsampling, we borrow some idea of “proportionate stratified sampling” IlhOl . |[6dl . i.e. the same 
sampling fraction is used within each partition, in order to reduce the false negatives, especially when 
the sizes of different partitions are of quite a range. This way, we obtain the following group-sparsity 
based recovery model. 

min V ||w'g,||2 + AVlog(l + exp(-yi(w''^X'+c))) (4) 

w' ^^^ 

g'dg&Q i&J 

where w' and X' are corresponding parts of w and X, respectively, based on the selected voxels during 
the subsampling, and ^ is a predefined or estimated partitions of the brain. J is the set of the indices 
of the selected samples of the current subsamling. 

While “constrained block subsamplings” respects the prior knowledge Q, it also provides the flexibility 
that discovered discriminative regions can be of any shape. The final selected voxels from a cluster can 
be only part of it, because the randomness of the block subsampling in the different iterations of the 
stability selection procedure, makes the selection frequency score be able to outline structures of the true 
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discriminative regions. This kind of flexibility is of importance because the group information might not 
exactly reflect the true shapes of the discriminative brain regions. The grouping information is usually 
obtained via certain clustering algorithms which often only depend on the correlation information of 
voxels. 

For the specific multi-center data analysis, the inherent variability between different centers ll62l . lIMl 
requires our algorithm to pay more attention to the reduction of variance of feature selection results, though 
it may result in certain bias increasing, due to the bias-variance dilemma or bias-variance tradeoff ll6^ . In 
general, we would like to pay a little bias to save a lot of variance. Considering dimensionality reduction 
can decrease variance by simplifying models l[65ll . we still use the “averaging” idea 11661 applied to (|4l), 
because 11671 has proved that when the variables or features were positively correlated, their average was 
a strong feature, yielding a fit with lower variance than the individual variables. Specifically, by averaging 
the voxels picked by the block subsampling lying in the same cluster as a single super-voxel, the model 
can be further reduced to the following reduced dimensional version 

min |wg/| -f Ay^log(l-f exp(-i/j(w^Xj -f c))) (5) 

g'Cg£G i£j 

where w € M'^, and q is the number of clusters, w^/ is an average of voxels in the subset g' of cluster 
g & G, and X G is the corresponding averaged X. If the j-th column of X is selected due 

to the large magnitude of wj, then its represented picked blocked voxels lying in the group pO) G G 
(J = 1,2,... ,q) of X are all counted to be selected, in the non-clustered space, and its corresponding 
score Sj will be updated (z = 1,2,... ,p). Notice that the “averaging” of sumsampling is more than a 
simple spatial smoothing, due to different sumsampling results of different stability selection iterations. 
Therefore, the boundaries of the detected discriminative regions can be still trusted to certain accuracy. 

III. Numerical Experiments 

In this paper, we will demonstrate the stability and reliability of the RSS based support identification 
method. This point will be empirically verified based on the multi-center data, which is of data variation 
among different centers. In this paper, the data comes from the Autism Brain Imaging Data Exchange 
(ABIDE)- a grassroots consortium aggregating and openly sharing 1112 existing resting-state functional 
magnetic resonance imaging (R-fMRI) data sets with corresponding structural MRI (sMRI) and phe¬ 
notypic information from 539 individuals with ASDs and 573 age-matched typical controls (TCs; 7-64 
years) (http://fcon_1000.projects nitrc.org/indi/abide/). Then we follow the sample selection principle in 
IfSOl get the final data for 763 individuals (ASDs=360; TCs=403) from 14 centers. 
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We compare with two sample t-test, I 2 logistic regression, I 2 SVM, TV-Ll ifSTI . l\ logistic regression 
and randomized logistic regression (stability selection based on logistic regression). Two sample 
t-test and £ 2 -SVM are implemented as internal functions of MATLAB. The python code of TV-Ll is 
provided by Prof. Alexandre Gramfort of Telecom ParisTech and it is under integration in the Nilearn 
package. Most of the other algorithms have been implemented in LIBLINEAR |[68l or SEEP (Sparse 
Eearning with Efficient Projections) software If6^ . 

A. Image preprocessing 

R-fMRI scans were preprocessed with SPM(www.fil.ion.ucl.ac.uk/spm/). Image preprocessing steps 
included slice-timing and motion correction, smoothing, correted R-fMRI measures were normalized to 
Montreal Neurological Institute (MNI) 152 stereotactic space (3mm^ isotropic) with linear and non¬ 
linear registrations.Einear detrend and temporal filter (0.01-0.08Hz) were then applied on the normalized 
images. Eor each participant, we generate the following voxel-wise regional metrics: Amplitude of Eow 
Erequency Eluctuations (AEEE) ITTOII . AEEE is defined as fhe mean square root of the power spectrum 
density over the low frequency band (usually 0.01,0.08 Hz). After an individual AEEE map is obtained, 
zAEEE is obtained by performing a standard z-transformation on the voxels of individual AEEE map 
within a specific mask (gray matter mask). 

Eor structural MRI (3DT1, here), we perform Voxel-based Morphometry (VBM), which involves a 
voxel-wise comparison of the local concentration of gray matter between two groups of subjects fTTl . 
All the skull-stripped and reoriented images were spatially normalized to the Montreal Neurological 
Institute (MNI) space by minimizing the residual sum of squared differences between structural MRI 
and the ICBM 152 template image. The data were then resampled to 3 x 3 x 3mm^. All these images 
were segmented into GM (grey matter), WM (white matter) and CSE (cerebrospinal fluid) using fhe 
unified segmentation algorithm with incorporated bias correction. GM images were then smoothed with 
an 8-mm smoothing kernel. In this paper, we use the preprocessed VBM data as a main feature to test 
our algorithm. 

B. Settings of Algorithms 

Eor our algorithm, we use the SEEP |[69]| software to efficiently solve the model (lU), which in fact a 
common model. Eor the selection of regularization parameters of the involved multivariate methods 
except our method, cross validation is used. We set the subsampling rates are 0.5 and 0.1 for samples 
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and voxels, respectivley. The resampling times is 200. These settings are following the common stability 
selection default settings ll44ll . 

The internal k-means function of MATLAB is used for clustering. The default settings are used. For 
example, Squared Euclidean distance is adopted. One key parameter is the number of clusters, which 
is need to be prescribed. The number of parameters are set according to the number of samples. For 
this multi-center data, each center has around 55 samples. In general, we heuristically set the number 
of clusters between twice and 5 times of the number of samples. We have tried 3 different numbers of 
clustering such as 116, 160 and 200, respectively, in order to test how sensitive our final results are with 
this number. 

The block size is another key parameter of our algorithm. It can be set according to the probably 
available prior knowledge of the size of the discriminative regions. In case of no such prior knowledge, 
we can give it a moderate value no larger than 10 x 10 x 10. In the following subsection, we try different 
size of block from 3x3x3 to 7x7x7, which are considered to be in a reasonable range based on our 
experimental experiences. We can see that the final voxel selection resulf is nof sensifive fo fhe block 
size in fhis experimenf. 

In order fo control the false positives, we need to set a threshold value to filter out uninformative voxels, 
i.e., voxels whose corresponding weights have smaller magnitude than this threshold will be considered 
as noisy voxels. The setting of a threshold value is quite difficult in general and may adopt different 
schemes in different situations. For two sample t-test method, the number is generally determined by 
setting the p-value < 0.05 as significant level. For the multivariate methods, while the cross-validation 
method is widely used, it usually works well for the cases where there is a lot of training data available 
and the prediction accuracy is the main concern. In addition, it often causes large false negative rate 
though a small false positive rate can be often achieved. In this paper, we use a very simple but flexible 
way fo sef fhe fhreshold value. If is a probabilify based mefhod suggesfed in 121. We repeaf fhe descripfion 
of fhe probability method in |[3l as follows. Specifically, considering fhat fhe enfries of w are sparse, 
we assume fhaf fhe probabilify disfribufion of fhe enfries of w is Laplacian. Using all enfries of w as 
samples, we esfimafe fhe mean, fhe variance, and fhe inverse cumulative disfribufion funcfion F-1 of fhis 
Laplacian disfribufion. We fhen define R = {i ■. |wj| > 9, i = l,,p}, where 9 is chosen as F-1 (po)^ 
Po is a given probabilify (e.g., 0.975 or more resfricfive 0.99 in fhis paper). Unlike fhe single cenfer 
dafa analysis, fhe fhreshold is nof necessarily very resfricf because fhe operation of infersecfion of fhe 
selected features between different sits helps remove false positives. Moreover, we can gradually reduce 
the po from a large value (for example, 0.99) and check the corresponding selected voxels, which are the 
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intersection of the selected voxels of each center. We have observed that when pQ is reduced to a certain 
value (for example, 0.95), the selected voxels would stay unchanged. In such cases, we have obtained 
a reasonable threshold. We will show this heuristic rule in the following description of the numerical 
results. 

C. Evaluation Criteria 

We would like to demonstrate that our method can achieve a consistent feature selection result across 
different centers, due to our incorporation of ideas of stability selection and structural sparsity. Since 
there is no ground truth for evaluation, the confidence level of a selected relevant voxel depends on the 
number of centers where it is commonly selected. Since the feature selection algorithm is performed on 
each center individually, a voxel appearing at the voxel selection results of many centers is more likely 
to be true informative voxel. Therefore, we will show and compare the consistent selected voxels among 
different centers of all these involved algorithms. 

In addition, in order to further demonstrate that these intersected regions discovered by our algorithm are 
stable, we perform a false positive estimation scheme based on the permutation test and cross validation, 
which is proposed in ll72l . It aims to calculate the ratio of false positives among all the finally selecfed 
voxels. 

Finally, we also presenf fhe predicfion power of fhe selecfed voxels. The defails are in fhe following 
subsecfion ITTT-GI 

D. Clustering Results and Feature Selection Results 

As mentioned before, fhe proposed RSS challenged fhe limifafions of sfabilify selection. The original 
sfabilify selection fails fo explicifly make use of fhe assumpfion fhaf fhese fargefed voxels are oflen 
spatially contiguous and resulfs in disfribufed brain regions. We used K-means fo clusfer fhe voxels info 
groups fo model fhe spatial consfrainfs. We need fo emphasize fhaf fhe clusfers resulfed from K-means 
can nol accurafely model fhe underline homogeneify of fhe brain regions. However, due fo fhe adoption 
of subsamplings, we can reduce fhe bias of fhe final fealure selection resulfs caused by K-means-based 
clusfering. Figure [T] is fhe K-means clusfering resulf and corresponding feafure selecfion resulf on fMRI 
dafa of fhe SDSU cenfer. We can see fhe resulfs of K-means clusfering are reasonable and roughly mafch 
fhe prior knowledge of brain segmenfafion and infegrafion fo some degree. In addifion, while fhe clusfering 
information helps reduce fhe variance, fhe final selected brain regions do nol necessarily exacfly follow 
fhe grouping based on fhe K-means clustering. This way, fhe possible bias caused by fhe clusfering is 
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Fig. 1. The upper row is the clustering result by K-means with the number of clusters being 200, for the fMRI data of SDSU 
center. The lower row is the brain map of the selected voxels on the same center, with the threshold parameter po = 0.975. 



Fig. 2. The upper row is the clustering Results by Kmeans with the number of clusters being 200, for the sMRI data of NYU 
center. The lower row is the brain map of the selected voxels on the same center, with the threshold parameter po — 0.975. 


expected to be reduced. We can also observe similar phenomenon from sMRI data, as showed in Figure 

El 

We have also tested 3 different numbers of clusters such as 116, 160 and 200. Figure El and Figure |4] 
are the results of our algorithm applied to sMRI and R-fMRI data, respectively. We have observed that 
the final support identification results are not very sensitive to the number of clusters. The revealed brain 
regions are the same, though they might be of slightly different size for different settings. 
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Fig. 3. Selected overlapped voxels of our method shared by all 14 centers for the sMRI, i.e., 3DT1 data. We tested different 
settings of number of clusters (clusters no. ) and different po (prob.level) to demonstrate the robustness of our algorithm. 


E. Feature Selection Results Corresponding to Different Block Sizes 

Now we are checking the robustness of our algorithm in terms of different block sizes when performing 
block subsampling. While we do not have the formula for the optimal block size, we show that the 
final feature selection result is not sensitive to the block size. We tested 5 different block sizes, i.e. 
3x3x3 = 27, 4x4x4 = 64, 5x 5x = 125, 6 x 6 x 6 = 216 and 7 x 7 x 7 = 343. Figure |5] is 
the feature selection result when our algorithm is applied to the multi-center fMRI data set. We can see 
that the results corresponding to different block sizes only have a small difference in term of the size of 
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Fig. 4. Selected overlapped voxels of our method shared by all 14 centers for the fMRI data. We tested different settings of 
number of clusters (clusters no. ) and different po (prob.level) to demonstrate the robustness of our algorithm. 


detected areas. Moreover, they all indicate the same hrain locations. The similar conclusion can also he 
drawn for sMRI data, as suggested hy Figure 

F. Comparison of Consistency of Results of Different Algorithms 

All the univariate and multivariate pattern feature selection methods can identified certain number of 
possible disease-related voxels for each center. The number has been listed in Tables J] and |IIJ which 
corresponds to the results on the sMRI and fMRI, respectively. Besides TV-Ll, the multivariate methods 
mostly used different 9 values (denoted as prob.level), i.e. 0.99, 0.975 and even 0.96 when applying the 
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Fig. 5. Feature selection results of our algorithm corresponding to different block sizes for fMRI data. 

probability method to determine the selected voxels. For TV-Ll, we have the third-party python software 
and we use its internal default way to set the threshold value, which is partially based on cross-validation. 
The overlapped(5), represents the number of the overlap of selected voxels across different centers, where 
S means that at least S centers share these selected voxels. As the S increases, the number of overlapped 
voxels might decrease as expected. 

The variation of MRI data between different centers brings both challenges and benefits. On one hand, 
this kind of variation makes many existing state-of-the-art algorithms likely to select quite distinct voxels 
for different centers, because they tend to select a minimum subset of features to construct a classifier 
of the best predictive accuracy and often ignore “stability” of feature selection @, IITOl . ifTTl . Therefore, 
the results across different centers could be quite different and the overlapped parts are quite few. From 
Tables J] and JI] , we can see that the overlapped voxels corresponding to the two sample t-test, L2- 
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Fig. 6. Feature selection results of our algorithm corresponding to different block sizes for sMRI data 

logistic, L2-SVM, TV-Ll, and Ll-Logistic are almost none when S is larger than 11. Notice that while 
two sample t-test is widely believed to he a stable method for single center data analysis, it is not a 
good choice for multi-center data. TV-Ll, as a recently popular feature selection method for the single 
center data analysis, does not work well for the multi-centere data analysis, partially due to the lack 
of stability scheme. As expected, randomized £i logistic regression, as a method of stability selection, 
behavior better than other alternatives. However, it is still much worse than our method, especially when 
S is large, for example, 11 or 14, partially due to the failure of explicitly making use of the prior structural 
information of the discriminative voxels. For better comparison. We also plotted the selected voxels of 
different algorithms corresponding to 5 = 5, 9,11,14, in Figures |7J [8] and |9] for the sMRI data, and in 
Figures [TOl [TT] and [12] for the fMRI data, respectively. We can see that our method can always achieve 
a consistent voxel selection result across different centers while the other state-of-the-arts mostly fail. 
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Algorithms 

ttest 

L21ogistic 

L2SVM 

LI Logistic 

TV-Ll 

Randomized LI 

Our method 

No. Clusters 







116 

160 

200 

prob.level 


0.99 

0.975 

0.99 

0.975 

0.99 

0.975 


0.99 

0.975 

0.99 

0.975 

0.99 

0.975 

0.99 

0.975 

KKI 

1303 

10873 

16438 

10454 

16337 

12 

17 

8290 

8491 

14274 

8583 

8583 

8459 

8459 

8146 

8146 

Leuven 

1590 

10440 

15426 

10788 

16517 

51 

51 

6963 

8177 

13266 

7699 

7699 

7753 

7753 

8487 

8487 

MaxMun 

1983 

10456 

15511 

11537 

17568 

63 

64 

7492 

8250 

13145 

8105 

8105 

7975 

7975 

8395 

8395 

NYU 

2160 

10856 

16327 

10687 

16395 

422 

455 

7405 

8539 

14572 

8109 

8109 

8108 

8108 

8334 

8334 

OHSU 

1643 

9508 

14368 

11158 

17148 

12 

12 

6737 

8009 

12853 

9007 

9007 

9105 

9105 

9693 

9693 

Olin 

958 

10259 

15196 

11377 

17171 

1 

1 

6332 

8260 

13343 

9013 

9013 

8932 

8932 

8777 

8777 

Pitt 

954 

10764 

16084 

11398 

17298 

8 

8 

5341 

8562 

14355 

8593 

8593 

8632 

8632 

8678 

8678 

SDSU 

1915 

10648 

15590 

11970 

17985 

6 

6 

7176 

8628 

13362 

9766 

9766 

10477 

10477 

9753 

9753 

Standford 

2457 

10970 

16309 

11061 

16869 

50 

50 

10642 

8655 

13981 

9063 

9063 

9033 

9033 

8636 

8636 

Trinity 

2423 

10246 

15355 

11351 

17556 

9 

9 

9370 

8270 

12911 

8424 

8424 

8297 

8297 

8695 

8695 

UCLA 

23643 

10659 

15921 

10952 

16746 

178 

184 

7934 

8324 

13729 

7547 

7547 

7807 

7807 

8092 

8092 

UM 

2587 

10793 

16237 

11571 

17607 

387 

413 

7043 

8628 

14212 

8230 

8230 

8621 

8621 

8855 

8855 

USM 

8298 

10523 

15402 

11577 

17181 

323 

330 

7714 

8393 

13738 

7896 

7896 

7952 

7952 

7767 

7767 

Yale 

3801 

10462 

15617 

11201 

17275 

2 

2 

5749 

8312 

12916 

9143 

9143 

8862 

8862 

8592 

8592 

overlapped(14) 

0 

0 

2 

0 

0 

0 

0 

0 

0 

1 

1436 

1436 

1586 

1586 

1432 

1432 

overlapped(ll) 

0 

75 

531 

0 

135 

0 

0 

0 

177 

1664 

4557 

4557 

4674 

4674 

4838 

4838 

overlapped(9) 

0 

703 

3430 

152 

2133 

0 

0 

0 

1101 

5924 

6420 

6420 

6526 

6526 

6643 

6643 
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Summary of Number of Selected Voxels eor the multi-center sMRI data 


Algs 

ttest 

L21ogistic 

L2SVM 

LlLogistic 

TVLl 

Randomized LI 

Our method 

No. 







116 

160 

200 

prob.level 


0.99 

0.975 

0.99 

0.975 

0.99 

0.975 


0.99 

0.975 

0.99 

0.975 

0.96 

0.99 

0.975 

0.96 

0.99 

0.975 

0.96 

KKI 

4347 

10565 

15683 

11096 

16510 

249 

258 

5699 

8380 

13286 

9174 

9174 

9174 

7448 

7448 

7448 

7652 

7652 

7652 

Leuven 

2485 

9884 

15090 

11709 

17578 

261 

273 

5868 

8079 

13027 

8293 

15210 

15210 

11964 

14248 

14248 

12720 

14998 

14998 

MaxMun 

2819 

9746 

14667 

11523 

17317 

277 

286 

5994 

8022 

12723 

8872 

8872 

8872 

10148 

10148 

10148 

9964 

9964 

9964 

NYU 

8339 

9928 

15051 

11286 

16750 

653 

696 

5604 

8153 

13480 

8290 

8290 

8290 

8055 

8055 

8055 

8181 

8181 

8181 

OHSU 

1785 

9214 

13853 

11985 

17808 

239 

248 

5938 

7837 

12300 

10831 

10831 

10831 

12604 

12604 

12604 

8077 

12040 

12040 

Olin 

1532 

10534 

15656 

11639 

17377 

189 

193 

5252 

8252 

13371 

9618 

9618 

9618 

9515 

11491 

11491 

7981 

11364 

11364 

Pitt 

1988 

9966 

15057 

11847 

17811 

296 

321 

5863 

7999 

12901 

8522 

8522 

8522 

7930 

8229 

8229 

8726 

8726 

8726 

SDSU 

3539 

10185 

15189 

12117 

18201 

222 

229 

5938 

8635 

13629 

11461 

11461 

11461 

10435 

10435 

10435 

9966 

10412 

10412 

Standford 

2486 

10806 

16059 

11772 

17467 

211 

217 

5853 

8529 

13836 

9616 

9616 

9616 

9400 

9400 

9400 

11640 

11640 

11640 

Trinity 

2742 

9874 

14815 

12164 

18045 

349 

369 

5598 

8104 

12848 

8385 

8385 

8385 

8964 

8964 

8964 

8447 

8447 

8447 

UCLA 

1166 

9703 

14786 

11818 

17501 

486 

500 

5516 

8046 

13191 

8507 

10002 

10002 

8895 

9896 

9896 

7920 

9854 

9854 

UM 

3545 

10645 

16123 

11556 

17277 

508 

537 

5706 

8423 

13890 

9417 

12500 

12500 

10066 

12204 

12204 

8679 

11344 

11344 

USM 

5238 

9618 

14456 

12003 

17765 

488 

511 

6006 

8183 

12954 

7359 

7359 

7359 

8892 

8892 

8892 

8786 

8786 

8786 

Yale 

2907 

9112 

13684 

11314 

16652 

175 

180 

5874 

7609 

12026 

11033 

11033 

11033 

10598 

10598 

10598 

11630 

11630 

11630 

OL(14) 

0 

0 

2 

0 

1 

0 

0 

0 

I 

24 

645 

721 

721 

688 

771 

771 

801 

857 

857 

OL(ll) 

0 

104 

583 

4 

153 

0 

0 

0 

158 

1129 

2708 

3118 

3118 

2946 

3224 

3224 

3131 

3596 

3596 

OL(9) 

0 

872 

3344 

153 

2298 

0 

0 

0 

892 

3667 

4558 

4871 

4871 

4896 

5180 

5180 

4968 

5427 

5427 


—-----TABLE 11—----- 

Summary of Number of Selected Voxels eor the multi-center eMRI data 


On the other hand, the variation of data between different centers can help us to verify the reliahility of 
the selected discriminative voxels ll73l . If these selected voxels are all shared hy these involved centers, 
a better reliability of them is expected, even as the potential biomarker in the possible clinical diagnosis. 
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Despite the computationally challenge, our algorithm incorporating the ideas of stability selection and 
structural sparsity successfully revealed significant number of the shared selected voxels. 

Moreover, we would like to point out that there exists a practical simple way to set the threshold value 
for feature selection for our algorithm. We have empirically observed that the sorted absolute values of the 
weights (from large to small) first gradually change then suddenly decreases dramatically and produce a 
gap. Usually we can set the threshold value at the location of the gap, and it is expected to achieve a good 
control of false positives. A similar phenomenon has been observed and exploited in our previous related 
work about sparse signal recovery llTdl . Specifically, lef us look af fhe Table U For the two threshold 
values corresponding to probability levels po = 0.99 and po = 0.975, respectively, the same discriminative 
voxels are obtained for our algorithm as each of the center. That is to say, the sorted weights of voxels 
indeed have a gap. Using this threshold value, we obtained an interpretable discriminative brain regions 
and we will give further analysis of these regions in the following subsection. Table JI] tells a similar 
story, except the gap is around the threshold values corresponding to the probability levels pQ = 0.975 
and Po = 0.96. 


G. Subsequent Analysis of Feature Selection Results 

We have observed that the overlap of selected voxels by our algorithm is significantly higher. We 
believe that most of the selected voxols are true positives because the false positive usually are unlikely 
to repeat on every center. Furthermore, we would like to further estimate the ratio of false positives from 
another point of view based on permutation test and cross validation, as suggested in |[72l . The results are 
summarized in Table ITm For sMRI, the estimated false positives ratios are all smaller than 3%, for the 
overlap of all the 14 centers, for all the settings using different clustering numbers (116, 160, 200) and 
different probability levels (0.99 and 0.975). For fMRI, the estimated false positives ratios are slightly 
larger, but still smaller than 6%. Notice that these numbers are only rough estimations for reference, not 
necessarily accurately reflecting the ground truth. In addition, we provide the classification performances 
when using those identified voxels as fealures fo consfruct a classifier. Nofice fhaf for many alfemafive 
mefhods, they could not achieve consistent results if S is large and so we have to reduce S to 5 or even 2. 
For our method, we use S' = 11. When performing the classification, we pool all samples from 14 centers 
together and we have in total 763 samples. We randomly pick 700 as the training samples and the rest 63 
as the test samples and we repeat this procedure for 100 times. We use the £2 logistic regression as the 
classifier. The classification accuracy is summarized in Tables |IV] and |V] for sMRI data and fMRI data, 
respectively. While the selected voxels by our algorithm can achieve the best classification performance. 
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they are generally not very high. A possible reason is that a simple logistic classifier can not deal with 
the relatively high data variation of these samples from different centers and we might consider more 
sophisticated classifiers in fhe future. 



Our method 


Our method 

No. Clusters 

116 

160 

200 

No. Clusters 

116 

160 

200 

prob.level 

0.99 

0.975 

0.99 

0.975 

0.99 

0.975 

prob.level 

0.99 

0.975 

0.99 

0.975 

0.99 

0.975 

No. Totally Selected 

1436 

1436 

1586 

1586 

1432 

1432 

No. Totally Selected 

645 

721 

688 

771 

801 

857 

No. Estimated False Positives 

25 

25 

37 

37 

23 

23 

No. Estimated False Positives 

24 

33 

27 

39 

45 

51 


Number of Selected Voxels (across 14 centers) and Estimated False Positives. The left part corresponds 

TO THE MULTI-CENTER SMRI DATA AND THE RIGHT PART CORRESPONDS TO THE MULTI-CENTER FMRI DATA 


ttest 

121ogistic 

L2SVm 

LlLogistic 

TVLl 

Randomized LI 

Ours 

overlapped(5) 

overlapped(l 1) 

overlapped! 11) 

overlapped(2) 

overlapped(5) 

overlapped! 11) 

overlapped! 11) 

0.6046 

0,6054 

0.6121 

0.6012 

0.6113 

0.5941 

0.6332 


TSE 


TTfV 


Prediction power of the selected voxels for sMRI data 


ttest 

121ogistic 

L2SVm 

LlLogistic 

TVLl 

Randomized LI 

Ours 

overlapped!5) 

overlapped! 11) 

overlapped!! 1) 

overlapped!2) 

overlapped!5) 

overlapped! 11) 

overlapped! 11) 

0.6062 

0.5908 

0.6145 

0.5944 

0.5997 

0.5801 

0.6283 


---'f'ABLH V-- 

Prediction power of the selected voxels for fMRI data 


H. Discussion Based on Biomarkers from Multimodal Data 

We aim to show that these discovered hiomarkers are interpretahle and correspondingly support the 
reliahility and stability of our algorithm. According to Figures [3] and |4] , we found the grey mat¬ 
ter abnormalities in precuneus, posterior cingulate cortex, insular, middle temporal gyrus-R, fusiform 
gyrus, hippocampus, parahippocampal gyrus may have the potential be the related hiomarkers of sMRI 
data(VBM) for distinguishing between the autism and the healthy controls. And with respect to the fMRI 
(zalff), medial prefrontal, precuneus, temporal pole, superior and medial frontal gyrus be the revealed 
regions as the discriminations of ASD. The potential structural biomarker regions are mainly concentrated 
in the DMN (default mode network), such as precuneus and posterior cingulate cortex(PCC) which are 
regarded relating to primary responsible for autobiographical memory and self-reference processing. For 
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example, precuneus and PCC are primarily responsible for autobiographical memory and the self-reference 
processing. Significant abnormality in insula was also revealed, it is a critical component of SN(salience 
network) and would be involved in switching between central-executive and DMN 1751 . And the SN is 
thought to regulate dynamic changes in other networks and the damage to the structure of SN should 
disrupt the regulation of associated networks such as DMN 1761 . Besides the brain regions mentioned 
above, some other regions such as fusiform gyrus, hippocampus and middle temporal gyrus-R which are 
reported in previous studies fJl\ . In our research, those overlapped regions we identified as discriminatied 
feafures, are found across all the 14 centers. Thus we can say that they may be the ASD-related stable 
biomarkers for the diagnoses and further analysis, according to the characters of our algorithm purposed 
and discussed before. Similar to the structural data, the functional abnormalities of the ASD mainly lie 
in regions of medial prefrontal (mPFC), precuneus and temporal pole (above three are the components of 
the DMN) and anterior cingulate(ACC). mPFC is in charge of social cognition associated with self and 
others ITU. In addition, the alterations of ACC, a part of the brains limbic system, may be relevant to 
the character of autism 1791 . Accordingly, the consistent overlapped features in ASD from multi-center 
data further confirmed fhe functional abnormalities in ASD llSOl . lEH- 

IV. Conclusion and Future work 

In this paper, we consider a commonly existent problem in mental disease diagnosis or congtive 
science based on brain imaging, i.e. locating the discriminative or activitaed brain regions correspoing to 
the specific decease or certain external stimulus including looking at multimedia materials. We revisit our 
recently proposed algorithm which incorporates the ideas of stability selection and structural sparsity, and 
aims to show its abilitity to reliably and stably find ouf these intrested brain regions despite the inherent 
data variations or noise. We consider to demonstrate our point via the multi-center MRI data analysis, 
which is also a promising research direction by itself. Specifically, our algorithm helps us to discover 
a set of consistent selected voxels across different centers and these voxels are quite interpretable and 
can be possible biomarkers for the correspoinding medical diagnosis. Considering that most of existing 
algorithms fail to overcome the difficulty brought by the variation between centers and therefore are 
not able to get a consistent result, our algorithm also seems to be promising for the multi-center data 
analysis from the algorithmic point of view. However, some therotical analysis of the performance of 
our algorithm will be required, considering most of current study are more on the empirical side. We 
will further test the performance of our algorithm through more kinds of multi-center data and feature 
types in the future work, epecially in the cognitive experiments including the study of brain responses 
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Fig. 7. Selected overlapped voxels of our method shared hy at least 9 centers for the sMRI data, where the prohahility value 
is 0.99. We compare the results by different methods. We can see our method is the best which can find out significantly 
interpretable discriminative structures. Randomized method and I 2 logistic regression are the second best and all the rest 
methods fail to find out consistent selected voxels among different centers. 


to multimedia. 


Acknowledgment 

Thanks to Prof. Alexandre Gramfort of Telecom ParisTech for kindly providing us with TV-Ll code, 
which is under integration in the Nilearn package. 


June 9, 2015 


DRAFT 





























JOURNAL OF UTbX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 


23 



Fig. 8. Selected overlapped voxels of our method shared hy at least 11 centers for the sMRI data, where the prohahility value 
is 0.99. We compare the results by different methods. We can see our method is the best in terms of revealing interpretable 
discriminative structures. 
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